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Abstract Object tracking is an ubiquitous problem that appears in many applications such as remote 
sensing, audio processing, computer vision, human-machine interfaces, human-robot interaction, etc. 
Although thoroughly investigated in computer vision, tracking a time-varying number of persons re¬ 
mains a challenging open problem. In this paper, we propose an on-line variational Bayesian model for 
multi-person tracking from cluttered visual observations provided by person detectors. The paper has 
the following contributions. We propose a variational Bayesian framework for tracking an unknown 
and varying number of persons. Our model results in a variational expectation-maximization (VEM) 
algorithm with closed-form expressions both for the posterior distributions of the latent variables and 
for the estimation of the model parameters. The proposed model exploits observations from multiple 
detectors, and it is therefore multimodal by nature. Finally, we propose to embed both object-birth 
and object-visibility processes in an effort to robustly handle temporal appearances and disappear¬ 
ances. Evaluated on classical multiple person tracking datasets, our method shows competitive results 
with respect to state-of-the-art multiple-object tracking algorithms, such as the probability hypothesis 
density (PHD) filter, among others. 

Keywords Multi-person tracking • Bayesian tracking • variational expectation-maximization • causal 
inference • person detection 


1 Introduction 

The problem of tracking a varying number of objects is ubiquitous in a number of fields such as remote 
sensing, computer vision, human-computer interaction, human-robot interaction, etc. While several 
off-line multi-object tracking methods are available, on-line multi-person tracking is still extremely 
challenging [1]. In this paper we propose an on-line tracking method within the tracking-by-detection 
(TbD) paradigm, which gained popularity in the computer vision community thanks to the develop¬ 
ment of efficient and robust object detectors [2]. Moreover, one advantage of TbD paradigm is the 
possibility of using linear mappings to link the kinematic (latent) states of the tracked objects to the 
observations issued from the detectors. This is possible because object detectors efficiently capture and 
filter out the non-linear effects, thus delivering detections that are linearly related to the kinematic 
latent states. 
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In addition to the difficulties associated to single-object tracking (occlusions, self-occlusions, visual 
appearance variability, unpredictable temporal behavior, etc.), tracking a varying and unknown number 
of objects makes the problem more challenging because of the following reasons: (i) the observations 
coming from detectors need to be associated to the objects that generated them, which includes the 
process of discarding detection errors, (ii) the number of objects is not known in advance and hence 
it must be estimated, mutual occlusions (not present in single-tracking scenarios) must be robustly 
handled, (iv) when many objects are present the dimension of the state-space is large, and hence the 
tracker has to handle a large number of hidden-state parameters, (v) the number of objects varies over 
time and one has to deal with hidden states of varying dimensionality, from zero when there is no 
visible object, to a large number of detected objects. Note that in this case and if a Bayesian setting 
is being considered, as is often the case, the exact recursive filtering solution is intractable. 

In computer vision, previously proposed methodological frameworks for multi-target tracking can 
be divided into three groups. Firstly, the trans-dimensional Markov chain model [3], where the dimen¬ 
sionality of the hidden state-space is part of the state variable. This allows to track a variable number 
of objects by jointly estimating the number of objects and their kinematic states. In a computer vision 
scenario, mm exploited this framework for tracking a varying number of objects. The main drawback 
is that the states are inferred by means of a reversible jump Markov chain Monte Carlo sampling, which 
is computationally expensive [7]. Secondly, a random finite set multi-target tracking formulation was 
proposed piQlfTO] . Initially used for radar applications [8] , in this framework the targets are modeled as 
realizations of a random finite set which is composed of an unknown number of elements. Because an 
exact solution to this model is computationnally intensive, an approximation known as the probability 
hypothesis density (PHD) filter was proposed [11] . Further sampling-based approximations of random 
det based filters were subsequently proposed, e.g. [Tsirrarn] . These were exploited in m for tracking a 
time-varying number of active speakers using auditory cues and in m for multi-target tracking using 
visual observations. Thirdly, conditional random fields (CRF) were also chosen to address multi-target 
tracking [TTlfTSlfT^ . In this case, tracking is casted into an energy minimization problem. In another 
line of research, in radar tracking, other popular multi-targets tracking model are joint probabilistisc 
data assocation (JPDA), and multiple hypothesis filters [20] . 

In this paper we propose an on-line variational Bayesian framework for tracking an unknown and 
varying number of persons. Although variational model are very popular in machine learning, their 
use in computer vision for object tracking has been limited to tracking situation involving a fixed 
number of targets [21]. Variational Bayes methods approximate the joint a posteriori distribution of 
the latent variables by a separable distribution [22ll23] . In an on-line tracking scenario, where only 
causal (past) observations can be used, this translates into approximating the filtering distribution. 
This is in strong contrast with off-line trackers that use both past and future observations. The pro¬ 
posed tracking algorithm is therefore modeling the a posteriori distribution of the hidden states given 
all past observations. Importantly, the proposed framework leads to closed-form expressions for the 
posterior distributions of the hidden variables and for the model parameters, thus yielding an intrinsi¬ 
cally efficient filtering procedure implemented via an variational EM (VEM) algorithm. In addition, a 
clutter target is defined so that spurious observations, namely detector failures, are associated to this 
target and do not contaminate the filtering process. Eurthermore, our formalism allows to integrate 
in a principled way heterogeneous observations coming from various detectors, e.g, face, upper-body, 
silhouette, etc. Remarkably, objects that come in and out of the field of view, namely object appear¬ 
ance and disappearance, are handled by object birth and visibility processes. In details, we replace the 
classical death process by a visibility process which allows to put asleep tracks associated with persons 
that are no longer visible. The main advantage is that these tracks can be awaken as soon as new 
observations match their appearance with high confidence. Summarizing, the paper contributions are: 

— Cast the problem of tracking a time-varying number of people into a variational Bayes formulation, 
which approximates the a posteriori filtering distribution by a separable distribution; 

— A VEM algorithm with closed-form expressions, thus inherently efficient, for the update of the a 
posteriori distributions and the estimation of the model parameters from the observations coming 
from different detectors; 
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— An object-birth and an object-visibility process allowing to handle person appearance and disap¬ 
pearance due either to occlusions or people leaving the visual scene; 

— A thorough evaluation of the proposed method compared with the state-of-the-art in two datasets, 
the cocktail party dataset and a dataset containing several sequences traditionally used in the 
computer vision community to evaluate multi-person trackers. 

The remainder of this paper is organized as follows. Section reviews previous work relevant to 
our work method. Section details the proposed Bayesian model and a variational model solution is 
presented in Section In Section we depict the birth and visibility processes allowing to handle an 
unknown and varying number of persons. Section describes results of experiments and benchmarks 
to assess the quality of the proposed method. Finally, Section draws some conclusions. 


2 Related Work 

Generally speaking, object tracking is the temporal estimation of the object’s kinematic state. In the 
context of image-based tracking, the object state is typically a parametrization of its localization in the 
(2D) image plane. In computer vision, object tracking has been thoroughly investigated [24]. Objects 
of interest could be people, faces, hands, vehicles, etc. According to the considered number of objects 
to be tracked, tracking can be classified into single-object tracking, fixed-number multi-object tracking, 
and varying-number multi-object tracking. 

Methods for single object tracking consider only one object and usually involve an initialization step, 
a state update step, and a reinitialization step allowing to recover from failures. Practical initialization 
steps are based on generic object detectors allowing to scan the input image in order to find the object 
of interest [2511^ . Object detectors can be used for the reinitialization step as well. However, using 
generic object detectors is problematic when other objects of the same kind than the tracked object are 
present in the visual scene. In order to resolve such ambiguities, different complementary appearance 
models have been proposed, such as object templates, color appearance models, edges (image gradients) 
and texture, (e.g. Gabor features and histogram of gradient orientations). Regarding the update step, 
the current state can be estimated from previous states and observations with either deterministic m 
or probabilistic [28] methods. 

Even if it is still a challenging problem, tracking a single object is very limited in scope. Rapidly, 
the computer vision community drove its attention towards fixed-number multi-object tracking m- 
Additional difficulties are encountered when tracking multiple objects. Firstly, there is an increase of 
the tracking state dimensionality as the multi-object tracking state dimensionality is the single ob¬ 
ject state dimensionality multiplied by the number of tracked objects. Secondly, associations between 
observations and objects are required. Since the observation-to-object association problem is combi¬ 
natorial [301120] ■ it must be carefully addressed when the number of objects and of observations are 
large. Thirdly, because of the presence of multiple targets, tracking methods have to be robust also to 
mutual occlusions. 

In most practical applications, the number of objects to be tracked, is not only unknown, but 
it also varies over time. Importantly, tracking a time-varying number or objects requires an efficient 
mechanism to add new objects entering the field of view, and to remove objects that moved away. In a 
probabilistic setting, these mechanisms are based on birth and death processes. Efficient multi-object 
algorithms have to be developed within principled methodologies allowing to handle hidden states of 
varying dimensionality. In computer vision, the most popular methods are based on conditional random 
fields [3Tll 1811191132] , on random finite sets [nniTsiiTsi or on the trans-dimensional Markov chain pniniHi 
i- 0 presents an interesting approach where occlusion state of a tracked person is explicitly modeled 
in the tracked state and used for observation likelihood computation. Less popular but successful 
methodologies include the Bayesian multiple blob tracker of [33], the boosted particle filter for multi¬ 
target tracking of [34] and the Rao-Blackwellized filter for multiple objects tracking [35], graph based 
representation for multi-object tracking jsaisz]- It has to be noticed in other communities, such as 
radar tracking, multi-object tracking has been deeply investigated. Many models have been proposed 
such as the probabilistic data association filter (PDAF), the joint PDAF, multiple hypothesis tracking 
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[20] . However, the differences between multi-object tracking in radar and in computer vision are mainly 
two. On the one hand, most tracking method for radar consider point-wise objects, modeling a punctual 
latent state, whereas in computer vision objects are represented using bounding boxes in addition to 
the punctual coordinates. On the other hand, computer vision applications benefit from the use of 
visual appearance, which is mainly used for object identification [38] . 

Currently available multi-object tracking methods used in computer vision applicative scenarios 
suffer from different drawbacks. CRF-based approaches are naturally non-causal, that is, they use 
both past and future information. Therefore, even if they have shown high robustness to clutter, they 
are only suitable for off-line applications when smoothing (as opposite to filtering) techniques can be 
used. PHD filtering techniques report good computational efficiency, but they are inherently limited 
since they provide non-associated tracks. In other words, these techniques require an external method in 
order to associate observations and tracks to objects. Finally, even if trans-dimensional MCMC based 
tracking techniques are able to associate tracks to objects using only causal information, they are 
extremely complex from a computational point of view, and their performance is very sensitive to the 
sampling procedure used. In contrast, the variational Bayesian framework we propose associates tracks 
to previously seen objects and creates new tracks in an unified framework that filters past observations 
in an intrinsically efficient way, since all the steps of the algorithm are expressed in closed-form. Hence 
the proposed method robustly and efficiently tracks a varying and unknown number of persons from a 
combination of image detectors. 


3 Variational Bayesian Multiple-Person Tracking 

3.1 Notations and Definitions 

We start by introducing our notations. Vectors and matrices are in bold A, a, scalars are in italic A, a. 
In general random variables are denoted with upper-case letters, e.g. A and A, and their realizations 
with lower-case letters, e.g. a and a. 

Since the objective is to track multiple persons whose number may vary over time, we assume that 
there is a maximum number of people, denoted by N, that may enter the visual scene. This parameter 
is necessary in order to cast the problem at hand into a finite-dimensional state space, consequently 
N can be arbitrarily large. A track n at time t is associated to the existence binary variable Ctn taking 
the value Ctn = 1 if the person has already been seen and e^n = 0 otherwise. The vectorization of 
the existence variables at time t is denoted by = (e^i, and their sum, namely the effective 

number of tracked persons at t, is denoted by Nf = The existence variables are assumed 

to be observed in sections 3 and 4; Their inference, grounded in a track-birth stochastic process, is 
discussed in section 5. 

The kinematic state of person n is a random vector X^n = (L^,U^)^ G where L^n ^ 1^^ is 
the person location, i.e., 2D image position, width and height, and JJtn ^ 1^^ is the person velocity in 
the image plane. The multi-person state random vector is denoted by = (X^,..., X^)^ G 
Importantly, the kinematic state is described by a set of hidden variables which must be robustly 
estimated. 

In order to ease the challenging task of tracking multiple persons with a single static camera, we 
assume the existence of I detectors, each of them providing Kl localization observations at each time 
t, with i G [1... /]. Fig. [^provides examples of face and upper-body detections (see Fig. |l(a)[ ) and of 
full-body detections (see Fig. |l(b)] ). The k-th localization observation gathered by the i-th detector at 
time t is denoted by G and represents the location (2D position, width, height) of a person 

in the image. The set of observations provided by detector i at time t is denoted by yj = 
and the observations provided by all the detectors at time t is denoted by yt = {yt}f=i- Associated to 
each localization detection yj^, there is a photometric description of the person’s appearance, denoted 
by This photometric observation is extracted from the bounding box of yj^. Altogether, the 
localization and photometric observations constitute the raw observations = {Ytk^ ^tk) ^^ed by our 
tracker. Analogous definitions to yl and yt hold for hj = — {^tk}k=i 
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(a) (b) 


Fig. 1 Examples of detections used as observations by the proposed person tracker: upper-body, face (a), and full- 
body (b) detections. Notice that one of the faces was not detected and that there is a false full-body detection in the 
background. 


Of = Importantly, when we write the probability of a set of random variables, we refer to the 

joint probabilities of all random variables in that set. For instance: p(oJ) = p{oli ,..., 

We also define an observation-to-person assignment (hidden) variable associated with each 
observation Formally, ^tk is a categorical variable taking values in the set = n 

means that is associated to person n. ZJ and Zt are defined in an analogous way to yj and y^. 
These assignment variables can be easily used to handle false detections. Indeed, it is common that 
a detection corresponds to some clutter instead of a person. We cope with these false detections by 
defining a clutter target. In practice, the index n = 0 is assigned to this clutter target, which is always 
visible, i.e. e^o = 1 for all t. Hence, the set of possible values for is extended to {0}U{1...A^}, 
and ZIj^ = 0 means that observation has been generated by clutter and not by a person. The 
practical consequence of adding a clutter track is that the observations assigned to it play no role in 
the estimation of the parameters of the other tracks, thus leading to estimation rules inherently robust 
to outliers. 


3.2 The Proposed Bayesian Multi-Person Tracking Model 


The on-line multi-person tracking problem is cast into the estimation of the filtering distribution of 
the hidden variables given the causal observations p(Zt, Xt|oi:t, ei:t), where oi^t = {oi,... ,Ot}. The 
filtering distribution can be rewritten as: 


Xt|Oi:t, ei:t) 


Xt, ei:t)p(Zt, Xt|Ol:t-l, ei:t) 

p{Ot\Oi:t-l,ei:t) 


( 1 ) 


Importantly, we assume that the observations at time t only depend on the hidden and visibility 
variables at time t. Therefore 0 writes: 


Xt|Oi:t, ei:t) 


pis^t I 1 Xf; , Q^l)p{Zl |ef;)p(X^ 01 :t) 


( 2 ) 


The denominator of 0 only involves observed variables and therefore its evaluation is not necessary 
as long as one can normalize the expression arising from the numerator. Hence we focus on the three 
terms of the latter, namely the observation model p(ot |Zt, X^, et), the observation-to-person assignment 
prior distribution p{Zt\et) and the dynamics of the latent state p(Xt|Xt_i, et), which appear when 
marginalizing the predictive distribution p(Xt|oi:t_i, ei:t) with respect to X^-i. Figure shows a 
graphical schematic representation of the proposed probabilistic model. 
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p(X*|X,_i,e,) 


Xf, et) 


p(Zi|et) 


Fig. 2 Graphical representation of the proposed multi-target tracking probabilistic model. 


3.2.1 The Observation Model 


The joint observations are assumed to be independent and identically distributed: 

I Ki 

p{ot\Zu^uet) = (3) 

i=l k=l 

In addition, we make the reasonable assumption that, while localization observations depend both 
on the assignment variable and kinematic state, the appearance observations only depend on the 
assignment variable, that is the person identity, but not on his/her kinematic state. We also assume the 
localization and appearance observations to be independent given the hidden variables. Consequently, 
the observation likelihood of a single joint observation can be factorized as: 

p(ojfe|Z,\,Xt,eO =p(yjfc,hjfe|Z4,Xt,et) (4) 

= p(y*,|Z4,X*,e*Mhj,|Z4,eO. 


The localization observation model is defined depending on whether the observation is generated by 
clutter or by a person: 

— If the observation is generated from clutter, namely zik = 0, the variable y\j^ follows an uniform 
distribution with probability density function 

— If the observation is generated by person n, namely ^tk = n, the variable follows a Gaussian 
distribution with mean P^X^n and covariance Xb ^ diYtk^ P^X^n, X*) 

The linear operator P* maps the kinematic state vectors onto the i-th space of observations. For 
example, when X^n represents the upper-body kinematic state (upper-body localization and velocity) 
and ylj^ represents the upper-body localization observation, P* is a projection which, when applied to a 
state vector, only retains the localization components of the state vector. When yj^ is a face localization 
observation, the operator P* is a composition of the previous projection, and an affine transformation 
mapping an upper-body bounding-box onto its corresponding face bounding-box. Finally, the full 
observation model is compactly defined by 


p{yik\zik 


n,Xt,et) = w(yL)'-'^*" {u{yikf°-9{yik; 


1—(5o 




( 5 ) 


where Sij stands for the Kronecker function. 

The appearance observation model is also defined depending on whether the observations is clutter 
or not. When the observation is generated by clutter, the appearance observation follows a uniform dis¬ 
tribution with density function u{hlj^). When the observation is generated by person n, the appearance 
observation follows a Bhattacharya distribution with density defined as 

= Texp(-AdB(hJfe,h„)), 
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where A is a positive skewness parameter, is the Battacharya distance between histograms, 

hn is the n-th person’s reference appearance modeQ This gives the following compact appearance 
observation model: 


=n,Xt,eO 

3.2.2 The Ohservation-to-Person Prior Distribution 

The joint distribution of the assignment variables factorizes as: 

I Ki 

p(z.ie.)=nn P{^tk l^t)• 

i=l k=l 


(6) 


( 7 ) 


When observations are not yet available, given existence variables e^, the assignment variables ^tk are 
assumed to follow multinomial distributions defined as: 


N 


P{Zlk = n\et) = etnoln with 


1 . 


(8) 


n=0 


Because etn takes the value 1 only for actual persons, the probability to assign an observation to a 
non-existing person is null. 

3.2.3 The Predietive Distribution 

The kinematic state predictive distribution represents the probability distribution of the kinematic 
state at time t given the observations up to time t — 1 and the existence variables p(Xt|oi:t-i, ei:t). 
The predictive distribution is mainly driven by the dynamics of people’s kinematic states, which are 
modeled consdering two hypothesis. Firstly the kinematic state dynamics follow a first-order Markov 
chain, meaning that the state only depends on state X^-i. Secondly, the person locations do not 
influence each other’s dynamics, meaning that there is one first-order Markov chain for each person. 
Formally, this can be written as: 


N 


— P(Xt|Xt-l, et) — p(Xtn|Xt-ln, etn)- 
n=l 

The immediate consequence is that the posterior distribution computes: 


P(X(|oi:(_i, ei:() — y In: ^tn)^ 


(9) 


( 10 ) 


For the model to be complete, p(Xtn|Xt_i,n: etn) needs to be defined. The temporal evolution of the 
kinematic state Xtn is defined as: 


P{^tn — l,n — l,n: etn) — Dxt—i^n:-^n) 


( 11 ) 


where is a uniform distribution over the motion state space, ^ is a Gaussian probability density 

function, D represents the dynamics transition operator, and An is a covariance matrix accounting for 
uncertainties on the state dynamics. The transition operator is defined as: 

/I 0 0 0 1 0\ 

010001 


D 


Mx4 


<-2x2 


02x2 
02x4 I2x2 


001000 
000100 
000010 
\0 0 0 0 0 1/ 


^ It should be noted that the normalization constant Wx = h;;,=i exp(— Ad^(h, hn))dh can be exactly computed 

only for histograms with dimension lower than 3. In practice Wx is approximated using Monte Carlo integration. 
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In other words, the dynamics of an existing person n is either follows a Gaussian with mean vector 
DXt-i,n and covariance matrix A^, or a uniform distribution if person n does not exist. The complete 
set of parameters of the proposed model is denoted with © = Ai-^), with At = 

\^tnJn=0,i=l' 


4 Variational Bayesian Inference 

Because of the combinatorial nature of the observation-to-person assignment problem, a direct opti¬ 
mization of the filtering distribution (§ with respect to the hidden variables is intractable. We propose 
to overcome this problem via a variational Bayesian inference method. The principle of this family of 
methods is to approximate the intractable filtering distribution p(Zt, Xt|oi:t, ei:t) by a separable dis¬ 
tribution, e.g. q{Zt) Y[n=o • According to the variational Bayesian formulation [22ll^ . given 

the observations and the parameters at the previous iteration 0°, the optimal approximation has the 
following general expression: 


logg(Zt) = Eg(Xj) {logp(Zt,Xt|oi.t,ei:t,©°)} , (12) 

logg(Xt„) = Eq(Zt)n„^„ 9 (Xt„) {logp(Zt,Xt|oi:t,ei:t,©°)} . (13) 

In our particular case, when these two equations are put together with the probabilistic model defined 
in (§, 0 and 0, the expression of q{Zt) factorizes further into: 

logq{Zi,) = E,(x,) {logp{Zi„Xt\o^..ue^..t,&°)} , (14) 

Note that this equation leads to a finer factorization that the one we imposed. This behavior is typical 
of variational Bayes methods in which a very mild separability assumption can lead to a much finer 
factorization when combined with priors over hidden states and latent variables, i.e. 0 and 0 . 
The final factorization writes: 

I K N 

n n n (15) 

i=l k=0 n=0 

Once the posterior distribution over the hidden variables is computed (see below), the optimal 
parameters are estimated using © = argmax© J(©, ©°) with J defined as: 


J(©,©°) = E, 


'g(Zt,xo 


{logp(Zt, Xt, Oi:t|ei:t, ©, ©°)} . 


(16) 


To summarize, the proposed solution for multi-person tracking is an on-line variational EM algo¬ 


rithm. Indeed, the factorization (15) leads to a variational EM in which the E-step consists of com¬ 


puting (14) and (13) and the M-step consists of maximizing the expected complete-data log-likelihood 
(16) with respect to the parameters. However, as is detailed below, for stability reasons the covariance 
matrices are not estimated with the variational inference framework, but set to a fixed value. The 
expectation and maximization steps of the algorithm are now detailed. 


4.1 E-Z-Step 


The estimation of q{Zl^) is carried out by developing the expectation (14). More derivation details can 
be found in |A.2[ which yields the following formula: 


q{Zlk =n) = 


(17) 


where 


i _ ^tn^tkn^tn 
tkn t i 

^tm^tkm^tn 


(18) 
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and is defined as: 


e 


i _ 

tkn 


uiyikMKk) 




-iTr P* 


-(s^- 


P'Ft 


n = 0, 

n 7^ 0, 


(19) 


where Tr(') is the trace operator and fjL^^ and F^n are defined by and ( [^ below. Intuitively, this 
approximation shows that the assignment of an observation to a person is based on spatial proxim¬ 
ity between the observation localization and the person localization, and the similarity between the 
observation’s appearance and the person’s reference appearance. 


4.2 E-X-Step 


The estimation of q^Ktu) is derived fro m (p!^ . Similarly to the previous posterior distribution, the 
mathematical derivations are provided in |A.3[ and boil down to the following formula: 

q(Xtn) = (20) 

where the mean vector fjL^^ and the covariance matrix F^n are given by 

I _ _i 

= (s')+(Dri_i,„DT + A„)-i) (21) 

2=1 k = 0 
I 

= Ptn (E E (S ')yik + (Dr,_a,„D^ + A„)-iDAi,_i,„). (22) 

i=l /c=0 


We note that the variational approximation of the kinematic-state distribution reminds the Kalman 
filter solution of a linear dynamical system with mainly one difference: in our solution (21) and ( [2^ , 
the mean vectors and covariance matrices are computed with the observations weighted by (see 
^ and (1^). 


4.3 M-step 

Once the posterior distribution of the hidden variables is estimated, the optimal parameter values 
can be estimated via maximization of J defined in (16). The M-step allows to estimate the model 
parameter. 

Regarding the parameters of the a priori observation-to-object assignment At we compute: 

Ki N 


J («tn) = E log(et„aL) s.t. E ^tna\n = 


k=l 


n=0 


and trivially obtain: 


On Z]/c=l ^ 


a+r. = 


tkn 


E N j 

m=0 Om Z-^k=l ^tkm 


(23) 


(24) 


The M-Step for observation covariances corresponds to the estimation of Elk This is done by 
maximizing 

Ki N 

= EE etnalkn log(yk> P*Xt„, S*) 

/c=l n=l 

with respect to Elk Differentiating J(E1*) with respect to El* and equating to zero gives: 


K N 


E E ^^nalkn (P'PtnP'^ + (yik - P^MtnXyik “ P'^tn)^) 

^ k=ln=l 


( 25 ) 
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The M-Step for kinematic state dynamics covariances corresponds to the estimation of for a 
fixed n. This done by maximizing cost function 

J(A„) = 

Equating differential of the cost J(A^) to zeros gives: 

= DPt—inE) + r^n + (^tn ~ l,n)(^tn ~ l,n) 

It is worth noticing that, in the current filtering formalism, the formulas for I]* and A^ are in¬ 
stantaneous, i.e., they are estimated only from the observations at time t. The information at time 
t is usually insufficient to obtain stable values for these matrices. Even if estimating 5]* and A^ is 
suitable in a parameter learning scenario where the tracks are provided, we noticed that in practical 
tracking scenarios, where the tracks are unknown, this does not yield stable results. Suitable priors on 
the temporal dynamics of the covariance parameters are required. Therefore, in this paper we assume 
that the observation and dynamical model covariance matrices are fixed. 


5 Person-Birth and Person-Visibility Processes 

Tracking a time-varying number of targets requires procedures to create tracks when new targets enter 
the scene and to delete tracks when corresponding targets leave the visual scene. In this paper, we 
propose a statistical-test based birth process that creates new tracks and a hidden Markov model 
(HMM) based visibility process that handles disappearing targets. Until here, we assumed that the 
existence variables e^n were given.In this section we present the inference modelfor the existence variable 
based on the stochastic birth-process. 


5.1 Birth Process 


The principle of the person birth process is to search for consistent trajectories in the history of 
observations associated to clutter. Intuitively, two hypotheses considered observation sequence is 
generated by a person not being tracked^^ and considered observation sequence is generated by 
clutter^^ are confronted. 

The model of considered observation sequence is generated by a person not being tracked’’^ 
hypothesis is based on the observations and dynamic models defined in ^ and ( [IT] ). If there is a 
not-yet-tracked person n generating the considered observation sequence {yt-L,/cz ,5 • • •, yt,/co}Q then 
the observation likelihood is p{yt-i^ki \^t-i,n) = ^(yt-z,/cd and the person trajectory is gov¬ 
erned by the dynamical model p(xt,n|xt-i,n) = Dxt-i,n, A^). Since there is no prior knowledge 

about the starting point of the track, we assume a “flat” Gaussian distribution over Xt-L,n, namely 
Ph{^t-L,n) = 9{^t-L,n] P^), which is approximatively equivalent to a uniform distribution over the 

image. Consequently, the joint observation distribution writes: 


^0 = p{yt,ko^' • • iyt-L,kL) 

— j P{yt,ko’) • • • iy t—L,kL 1 L,n 

L L—1 

= / IIp(yt,fe,|xt_;,„) X X Pb{'^t-2,n)d^t:t-L,n, 

'J 1 n 7 n 


1=0 


1=0 


(27) 


which can be seen as the marginal of a multivariate Gaussian distribution. Therefore, the joint obser¬ 
vation distribution p{yt,ko^yt-i,ki^ • • • ^yt-2,kL) Gaussian and can be explicitly computed. 

The model of ^dhe considered observation sequence is generated by clutter^^ hypothesis is based on 
the observation model given in When the considered observation sequence {yt,ko^ • • • ^yt-L,kL} 


2 


In practice we considered L = 2, however, derivations are valid for arbitrary values of L. 
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generated by clutter, observations are independent and identically uniformly distributed. In this case, 
the joint observation likelihood is 

L 

Ti =p{yt,ko,---,yt-L,kJ = (28) 

1=0 


Finally, our birth process is as follows: for all yt ko such that tq 
covariance matrix (see (20)). Also, the reference appearance model for the new person is defined as 

^t,ko • 


> Ti, a new person is added by 

setting etn = 1, with Is sct to the value of a birth 


5.2 Person-Visibility Process 

A tracked person is said to be visible at time t whenever there are observations associated to that 
person, otherwise the person is considered not visible. Instead of deleting tracks, as classical for death 
processes, our model labels tracks without associated observations as sleeping. In this way, we keep 
the possibility to awake such sleeping tracks when their reference appearance model highly matches 
an observed appearance. 

We denote the n-th person visibility (binary) variable by Vn? meaning that the person is visible at 
time t if Vtn = 1 and 0 otherwise. We assume the existence of a transition model for the hidden visibility 
variable Vn- More precisely, the visibility state temporal evolution is governed by the transition matrix, 
PiVtn = = i) = {1 — ^ whcrc Hy is the probability to remain in the same state. 

To enforce temporal smoothness, the probability to remain in the same state is taken higher than the 
probability to switch to another state. 

The goal now is to estimate the visibility of all the persons. For this purpose we define the visibility 
observations as ntn = ^tn being 0 when no observation is associated to person n. In practice, 

we need to filter the visibility state variables Vtn using the visibility observations lu other words, 
we need to estimate the filtering distribution p{Vtn\^i:tn^ which can be written as: 


P(^tn |Utn5 ^tn) ^ P(Utn |Ut—l,n)p(Ut—l,n l,n5 l) 

p{ytn\^l:t — l^n 1 ^l:t) 

where the denominator corresponds to integrating the numerator over Vtn- lu order to fully specify the 
model, we define the visibility observation likelihood as: 


p{l^tn\vtn,etn) = {eXTp{-XPtn)y*"0 - exp{-XPtn)y (30) 

Intuitively, when Utn is high, the likelihood is large if Vtn = 1 (person is visible). The opposite behavior 
is found when is small. Importantly, at each frame, because the visibility state is a binary variable, 
its filtering distribution can be straightforwardly computed. 


6 Experiments 

6.1 Evaluation Protocol 


We experimentally assess the performance of the proposed model using two datasets. The cocktail party 
dataset (CPD) is composed of two videos, CPD-2 and CPD-3, recorded with a close-view camera (see 
Figure [3(a) and |3(b)] ). Only people’s upper body is visible, and mutual occlusions happen often. CPD-3 
records 3 persons during 853 frames and CPD-2 records 2 persons during 495 frames. 

The second dataset is constituted of four sequences classically used in computer vision to evaluate 
multi-person tracking methods . Two sequences were selected from the MOT Challenge Dataset 
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(d) TUD-Stadtmitte (e) ParkingLot (f) TownCentre 


Fig. 3 Typical images extracted from the sequences used for tracking evaluation. Figures 


3(a) and 3(b) 


are from 


the Cocktail-Party Dataset. Figures |3(c)[ [3(d3] 3(e)[ [3(f)] display sample images from PETS09S2L1, TUD-Stadtmitte 
ParkingLot, and TownCentre which classically used in computer vision to evaluate multi-person tracking. 


[39] 0 TUD-Stadmitte (9 persons, 179 frames) and PETS09-S2L1 (18 persons, 795 frames). The third 
sequence is the TownCentre sequence (231 persons, 4500 frames) recorded by the Oxford Active Vision 
Lab. The last one is ParkingLot (14 persons, 749) recorded by the Center for Research in Computer 
Vision of University of Central Florida. TUD-Stadmitte records closely viewed full body pedestri¬ 
ans. PETS09-S2L1 and ParkingLot features a dozen of far-viewed full body pedestrians. TownCentre 
captures a very large number of far viewed pedestrians. This evaluation dataset is diverse and large 
(more than 6000 frames) enough to give a reliable assessment of the multi-person tracking performance 
measures. Figure shows typical views of all the sequences. 

Because multi-person tracking intrinsically implies track creation, deletion, target identity mainte¬ 
nance, and localization, evaluating multi-person tracking models is a non-trivial task. Many metrics 
have been proposed, see [^I41[l42] l43 ] . In this paper, for the sake of completeness we use several of 
them split into two groups. 

The first set of metrics follow the widely used CLEAR multi-person tracking evaluation metrics 
[42] which are commonly used to evaluate multi-target tracking where targets’ identities are jointly 
estimated together with their kinematic states. On the one side the multi-object tracking accuracy 
(MOTA) combines false positives (FP), missed targets (FN), and identity switches (ID). On the 
other side, the multi-object tracking precision (MOTP) measures the alignment of the tracker output 
bounding box with the ground truth. We also provide tracking precision (Pr) and recall (Rc). 

The second group of metrics is specifically designed for multi-target tracking models that do not 
estimate the targets’ identities, such as the PHD filter. These metrics compute set distances between 
the ground truth set of objects present in the scene and the set of objects estimated by the tracker m- 
The metrics are the Hausdorff metric, the optimal mass transfer (OMAT) metric, and the optimal 
sub-pattern assignment (OSPA) metric. We will use these metrics to compare the tracking results 
achieved by our variational tracker to the results achieved by the PHD filter which does not infer 
identities [44] . 

The computational cost of the proposed model is mainly due the the observation extraction, namely 
the person detection. This process is known in computer vision to be computationally intensive. How¬ 
ever, there are pedestrian detectors that achieve real time performances [45]. The VEM part of the 


3 


http://motchallenge.net/ 
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Fi g. 4 R egion splitting for computing the color histograms: Fig 
Fig |4(b)| shows an example of full body detection. 


4(a) 


shows an example with upper-body detection while 


Sequence 

Features 

Rc 

Pr 

MOTA 

MOTP 


u/uc 

53.3/70.7 

94.9/99.4 

46.6/64.3 

80.8/85.8 

CPD-2 

f/fc 

89.8/90.1 

94.6/94.6 

75.7/76.0 

76.6/76.7 


fu/fuc 

93.1/95.2 

95.3/96.2 

88.3/80.0 

76.5/82.9 


u/uc 

93.6/93.6 

94.4/99.6 

91.6/91.8 

85.0/86.8 

CPD-3 

f/fc 

62.5/62.8 

97.6/98.4 

58.9/59.7 

68.5/68.4 


fu/fuc 

91.0/92.6 

99.4/99.7 

88.3/90.1 

76.5/82.9 


Table 1 Evaluation of the proposed multi-person tracking method with different features on the two sequences of the 
cocktail party dataset. All measures are in %. 


tracking model, which involves only inversion of 6 by 6 matrices, is computationally efficient and can 
be made real time. It converges in less than 10 steps. 


6.2 Validation on the Cocktail Party Dataset 


In the cocktail party dataset our model exploits upper body detections obtained using [25] and face 
detections obtained using [26]. Therefore, we have two types of observations, upper body u and face F. 
The hidden state corresponds to the position and velocity of the upper body. The observation operator 

(see section 3.2.1) for the upper body observations simply removes the velocity components of the 
hidden state. The observation operator for the face observations combines a projection removing 
the velocity components and an affine mapping (scaling and translation) transforming face localization 
bounding boxes into the the upper body localization bounding boxes. The appearance observations are 
concatenations of joint hue-saturation color histograms of the torso split into three different regions, 
plus the head region as shown in Fig |4(a)[ 

Tables and show the performance of the model over the two sequences of the cocktail party 
dataset. While in Table we evaluate the performance of our model under the first set of metrics, 
in Table we compare the performance of our model to the one of the GMM PHD filter using the 
set-based metrics. Regarding the detectors, we evaluate the performance when using (i) upper body 
detectors, (ii) face detectors or (iii) both. For each of these three choices, we also compare when 
adding color histogram descriptors or when not using them. From now on, U and F denote the use of 
upper-body detectors and face detectors respectively, while C denotes the use of color histograms. 

Results in Table show that for the sequence CPD-2, while Pr and MOTP are higher when using 
upper-body detections u/uc, Rc and MOTA are higher when using face detections f/fc. One may 
think that the representation power of both detections may be complementary to each other. This 
is evidenced in the third row of Table [^ where both detectors are used and the performances are 
higher than in the first two rows, except for Pr and MOTP when using color. Regarding CPD-3, we 
clearly notice that the use of upper-body detections is much more advantageous than using the face 
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detector. Importantly, even if the performance reported by the combination of the two detectors does 
not significantly outperform the ones reported when using only the upper-body detectors, it exhibits 
significant gains when compared to using only face detectors. The use of color seems to be advantageous 
in most of the cases, independently of the sequence and the detections used. Summarizing, while the 
performance of the method using only face detections or upper-body detections seems to be sequence- 
dependent, there is a clear advantage of using the feature combinations. Indeed, the combination seems 
to perform comparably to the best of the two detectors and much better to the worst. Therefore, the 
use of the combined detection appears to be the safest choice in the absence of any other information 
and therefore justifies developing a model able to handle observations coming from multiple detectors. 


Sequence 

Method-Features 

Hausdorff 

OMAT 

OSPA 


VEM-u/uc 

239.4/239.2 

326.5/343.1 

247.8/244.5 


PHD-u 

276.6 

435.3 

567 

CPD-2 

VEM-f/fc 

116.3/115.5 

96.3/96.1 

110.9/108.0 

PHD-f 

124 

102 

185.8 


VEM-fu/fuc 

98.0/97.7 

80.3/7 

92.7/90.6 


PHD-fu 

95 

80 

168 


VEM-u/uc 

56.0/56.2 

44.4/44.2 

54.7/54.1 


PHD-u 

162.2 

244.6 

382.6 

CPD-3 

VEM-f/fc 

184.2/185.5 

200.8/201.3436 

203.3/205.0 

PHD-f 

208 

239.5 

445.2 


VEM-fu/fuc 

66.3/67.4 

52.7/52.8 

68.5/68.0 


PHD-fu 

49 

54.4 

181 


Table 2 Set metric based multi-person tracking performance measures of the proposed VEM and of the GMM PHD 
filter | 44| on the the cocktail party dataset. 


Table [^reports a comparison of the proposed VEM model with the PHD filter for different features 
under the set metrics over the two sequences of the cocktail party dataset. We first observe that the 
behavior described from the results of Table is also observed here, for a different group of measures 
and also for the PHD filter. Absolutely, while the use of the face or of the upper-body detections may be 
slightly more advantageous than the combination of detectors, this is sequence- and measure-dependent. 
However, the gain of the combination over the less reliable detector is very large, thus justifying the 
multiple-detector strategy when the applicative scenario allows for it and no other information about 
the sequence is available. The second observations is that the proposed VEM outperforms the PHD 
filter almost everywhere (i.e. except for CDP-3 with fu/fuc under the Hausdorff measure). This 
systematic trend demonstrates the potential of the proposed method from an experimental point of 
view. One possible explanation maybe that the variational tracker exploits additional information as 
it jointly estimates the target kinematic states together with their identities. 

Eigure gives the histograms of the number of persons estimation absolute errors made by the 
variational tracking model. These results shows that for over the Cocktail Party Dataset, the number 
of people present in the visual scene for in a given time frame are in general correctly estimated. This 
shows that birth and the visibility processes play their role in creating tracks when new people enter 
the scene, and when they are occluded or leave the scene. More than 80% of the time, the correct 
number of people is correctly estimated. It has to be noticed that errors are slightly higher for the 
sequence involving three person than for the sequence involving two persons. 

To give a qualitative flavor to the tracking performance, Eigure gives sample results achieved by 
the proposed model (VEM-FUC) on CPD-3. These images show that the model is able to correctly 
initialize new tracks, identify occluded people as no longer visible, and recover their identities after 
occlusion. Tracking results are provided as supplementary material. 

Eigure gives the estimated targets visibility probabilities (see Section 5.2) for sequence CPD-3 
with sample tracking images given in Eigure The person visibility show that tracking for person 1 
and 2 starts at the beginning of the sequence, and person 3 arrives at frame 600. Also, person 1 is 
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(a) CPD-2 (b) CPD-3 

Fig. 5 Histogram of absolute errors about the estimation of the number of people present in the visual scene over the 
Cocktail Party Dataset. 



Fig. 6 Sample tracking results on CPD-3. The green bounding boxes represent the face detections and the yellow 
bounding boxes represent the upper body detections. Importantly, the red bounding boxes display the tracking results. 



Fig. 7 Estimated visibility probabilities for tracked persons in sequence CPD-3. Every row displays the corresponding 
targets visibility probabilities for every time frame. Yellow color represents very high probability (close to 1), and blue 
color represents very low probabilities. 


occluded between frames 400 and 450 (see fourth image in the first row, and first image in the second 
row of Figure]^. 


6.3 Evaluation on classical computer vision video sequences 

In this tracking situation, we model a single person’s kinematic state as the full body bounding box 
and its velocity. In this case, the observation operator P simply removes the velocity information. 









































































16 


Sileye Ba et al. 


keeping only the bounding box’ position and size. The appearance observations are the concatenation 
of the joint HS histograms of the head, torso and legs areas (see Figure 4(b)| ). 


Sequence 

Method-Features 

Hausdorff 

OMAT 

OSPA 

TUD-Stadtmitte 

VEM-b/bc 

PHD-b 

150.4/125.9 

184.7 

197.5/184.9 

119 

483.2/482.4 

676 

PETS09S2L1 

VEM-b/bc 

PHD-b 

52.1/50.9 

70 

72.6/40.8 

44 

117.0/110.1 

163 

TownCentre 

VEM-b/bc 

PHD-b 

420./391.2 

430.5 

205.4/177.5 

173.8 

350.0 /335.2 
364.9 

ParkingLot 

VEM-b/bc 

PHD-b 

95.0/90.5 

169 

87.9/83.9 

94.0 

210.8/203.4 

415 


Table 3 Set metric based multi-person tracking Performance measures on the sequences the four sequences 
PETS09S2L1, TownCentre, ParkingLot,and TUD-Stadtmitte. 


We evaluate our model using only body localization observations (b) and jointly using body lo¬ 
calization and color appearance observations (bc). Table compare the proposed variational model 
to the PHD filter using set based distance performance metrics. As for the cocktail party dataset, in 
general, these results show that the variational tracker outperforms the PHD filter. 

In addition, we also compare the proposed model to two tracking models, proposed by Milan et al 
in m and by Bae and Yoon in m- Importantly, the direct comparison of our model to these two 
state-of-the-art methods must be done with care. Indeed, while the proposed VEM uses only causal 
(past) information, these two methods use both past and future detections. In other words, while ours 
is a filtering^ p!8l[3T| are smoothing methods. Therefore, we expect these two models to outperform the 
proposed one. However, the main prominent advantage of filtering methods over smoothing methods, 
and therefore of the proposed VEM over these two methods, is that while smoothing methods are 
inherently unsuitable for on-line processing, filtering methods are naturally appropriate for on-line 
task, since they only use causal information. 

Table reports the performance of these methods on four sequences classically used in computer 
vision to evaluate multi-target trackers. In this table, results over TUD-Stadmitte show similar per¬ 
formances for our model using or not appearance information. Therefore, color information is not very 
informative in this sequence. In PETS09-S2-L1, our model using color achieves better MOTA measure, 
precision, and recall, showing the benefit of integrating color into the model. As expected, Milan et al 
and Bae and Yoon, outperform the proposed model. However, the non-causal nature of their method 
makes them unsuitable for on-line tracking tasks, where the observations must be processed when 
received, and not before. 

Eigurej^ gives the histograms of the errors about the number of people present in the visual scene 
for the four sequences ParkingLot, TownCentre, PETS09-S2L1, TUD-Stadtmitte. These results show 
that, the four sequences are more challenging than the Cocktail Party Dataset (see figure [^. Among 
the four video sequences, TUD-Stadtmitte is the one where variational tracking model is making 
the estimated number of people is the less consistent. This can be explained by the quality of the 
observations (detections) over this sequence. Eor the PETS, and the ParkingLot dataset which involve 
about 15 persons, about 70% of the time the proposed tracking model is estimating the number of 
people in the scene with an error below 2 persons. Eor the TownCentre sequence which involves 231 
persons over 4500 frames, over 70% of the time, the error made by the variational tracker is below 7 
persons. This shows that, even in challenging situations involving occlusions due to crowd, the birth 
and the visibility process play their role. 

Eigure [^presents sample results for the PET09-S2L1 sequence. In addition, videos presenting the 
results on the second dataset are provided as supplementary material. These results show temporally 
consistent tracks. Occasionally, person identity switches may occur when two people cross. Remarkably, 
because the proposed tracking model is allowed to reuse the identity of persons visible in the past. 
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Sequence 

Method 

Re 

Pr 

MOTA 

MOTP 

TUD-Stadmitte 

VEM-b/bc 

m 

72.2/70.9 

84.7 

81.7/82.5 

86.7 

54.8/53.5 

71.5 

65.4/65.1 

65.5 

PETS09-S2L1 

VEM-b/bc 

EH] 

m 

90.1/90.2 

92.4 

86.2/87.6 

98.4 

74.9/76.7 

90.6 

83 

71.8/71.8 

80.2 

69.5 

TownCentre 

VEM-b/bc 

88.1/90.1 

71.5/72.7 

72.7/70.9 

74.9/76.1 

ParkingLot 

VEM-b/bc 

80.3/78.3 

85.2/87.5 

73.1/74 

70.8/71.7 


Table 4 Performance measures on the sequences of the second dataset. Comparison with ITslIan must be done with 
care since both are smoothing methods and therefore use more information than the proposed VEM. 




(c) ParkingLot (d) TUD-Stadtmitte 

Fig. 8 Histogram of errors about the estimation of the number of people present in the visual scene over ParkingLot, 
TownCentre, PETS09-S2L1, TUD-Stadtmitte. 


people re-entering the scene after having left, will be recognized the the previously used track will be 
awaken. 


7 Conclusions 

We presented an on-line variational Bayesian model to track a time-varying number of persons from 
cluttered multiple visual observations. Up to our knowledge, this is the first variational Bayesian model 
for tracking multiple persons, or more generally, multiple targets. We proposed birth and visibility 
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Fig. 9 Tracking results on PETS09-S2L1. Green boxes represent observations and red bounding boxes represent tracking 
outputs associated with person identities. Green and red bounding boxes may overlap. 
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processes to handle persons that are entering and leaving the visual field. The proposed model is 
evaluated with two datasets showing competitive results with respect to state of the art multi-person 
tracking models. Remarkably, even if in the conducted experiments we model the visual appearance 
with color histograms, our framework is versatile enough to accommodate other visual cues such as 
texture, feature descriptors or motion cues. 

In the future we plan to consider the integration of more sophisticated birth processes than the one 
considered in this paper, e.g. [46]. We also plan to extend the visual tracker to incorporate auditory cues. 
For this purpose, we plan to jointly track the kinematic states and the speaking status (active/passive) 
of each tracked person. The framework proposed in this paper allows to exploit audio features, e.g. voice 
activity detection and audio-source localization as observations. When using audio information, robust 
voice descriptors (the acoustic equivalent of visual appearance) and their blending with the tracking 
model will be investigated. We also plan to extend the proposed formalism to a moving camera such 
that its kinematic state is tracked as well. This case is of particular interest in applications such as 
pedestrian tracking for self-driving cars or for human-robot interaction. 


A Derivation of the Variational Formulation 


A.l Filtering Distribution Approximation 


The goal of this section is to derive an approximation of the hidden-state filtering distribution p(Zt, Xt|oi:t, ei:t), given 
the variational approximating distribution g(Zt_i,Xt—i) at t — 1. Using Bayes rule, the filtering distribution can be 
written as 


p(Zt, Xt|oi:t, ei:t) = 


p(ot|Zt, Xt, et)p(Zt, Xt|oi:t_i, ei:t) 


(31) 


It is composed of three terms, the likelihood p(ot|Zt, Xt, et), the predictive distribution p(Zt, Xt|oi:t_i, ei:t), and the 
normalization factor p(ot|oi:t_i, ei:t) which is independent of the hidden variables. The likelihood can be expanded as: 


I N 

p(ot|Zt,Xt,et) = n n (32) 

* = 1 k<Ki ^^=0 

where Sn is the Dirac delta function, and p{otk\Zi]^ = n, Xt,et) is the individual observation likelihood defined in § 

and 1^. 

The predictive distribution factorizes as 


p(Zt,Xt|oi:t_i,ei:t) = p(Zt|et)p(Xt|oi:t_i,ei:t). 

Exploiting its multinomial nature, the assignment variable distribution p{Zit\et) can be fully expanded as: 


p(Zt|et) 


I N 


n n 

*=1 k<Kl n=0 




(33) 


Using the motion state dynamics definition p{:>ctn\^t-in,^tn) the previous time motion state filtering distribution 
variational approximation ^(xt—inl^t—i) = p(xt_in|oi:t_i, defined in ( |20| >, motion state predictive distribution 

p(Xt = xt|oi:t_i,ei:t) can approximated by 

p(Xt = xt|oi:t_i,ei:t) 

= J p(xt|xt_i,et)p(xt_i|oi:t_i,ei:t_i)dxt_i 

N \ 

Y[ p(xtn|xt_ln,etn) 1 p(xt_i |oi:t_i, Gi_i)dxt_i 
n = l / 

N 

~ / n P(^tn|xt_in,etn)<7(xt_in|et_in)dxt_i,i...dxt_i,n 

^ n = l 
N 

~ n DTtnD^ + An)"*- (34) 

n=l 



where during the derivation, the filtering distribution of the kinematic state at time t — 1 is replaced by its variational 
approximation p(xt i|Qi:t-i,ei:t-i) = Yln=i 

Equations ( |32| >, ( |33| >, and ( |34| > define the numerator of the tracking filtering distribution The logarithm of this 
filtering distribution is used by the proposed variational EM algorithm. 
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A.2 Derivation of the E-Z-Step 


The E-Z-step corresponds to the estimation of q{Z'^^\Gt) given by (HI which, from the log of the filtering distribution, 
can be written as: 


logg(Z*j.|et) = 

N 

E 5 n(Z,\)E,(x,„|e*)[log (p(ytfc- Kk\Zik = n,Xt,e*)p(Z», = n|e*))] + C, ( 35 ) 

n=0 


where C gathers terms that are constant with respect to the variable of interest, in this case. By substituting 
p(yj^, = n, Xt,et), and p(Z^j^ = n\et) with their expressions ([^, (|^, and (|^, by introducing the notations 

4ko = «(yk)«(hk) 

4u=9(yL.PMtn.V)exp(-iTV(pTs»-Tr*„))fe(hj„h„) 

and after some algebraic derivations, the distribution of interest can be written as the following multinomial distribution 


= «kn = 


<^tn4kn 

X^rn=0 


(36) 


A.3 Derivation of the E-X-Step 

The E-step for the motion state variables consists in the estimation of q{yitn\^tn) using relation \og q{yitn\e-tn) = 
Eg(Zt,Xt/Xt^|et)[logP(Zt,Xt|oi:t,ei:t)] which can be expanded as 

I Ki 

logg(Xt,|et) = E E 4(z,-,|e*)['5n(4fc)] logg(yL;PXtn, 

i = l k=0 

+ log(«(Xin)^“'^*"g(Xtn; DrtnD’r + An)'^‘") + C, 

where, as above, C gathers constant terms. After some algebraic derivation one obtains q{l^tn\^tn) = gO^tn] l^tm 

where the mean and covariance of the Gaussian distribution are given by ( |21| > and by ( |22| . 
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